Using taxonomic data from NCBI taxonomy and KEGG’s functional ontology we have built a graph representation of both databases
MATCH
(k:ko{ko:'ko:K11258'})
MATCH
pathway=(taxon2:superkingdom)<-[:childof*]-(taxon:Taxon)--(contig:contigs)--(k)--(cpd:cpd)--(k)--(m:module)--()--(p:pathway)
RETURN
pathway
LIMIT
1;
Purple for Taxonomic Ranks, Green for KOs, Blue for Compounds, yellow for KEGG Pathways and grey for KEGG Modules. Here the KO K11258, acetolactate synthase II small subunit, (in green, left) has a contig (purple) assigned to it via blastX using DIAMOND MEGAN. Below’s the cypher query to generate the above query.
For more information about the fields within the nodes and edges/relationships please refer to the omics github repo
We can annotate which contigs are captured within pAss’s Max Diversity Region and apply the necessary processing to carry out diveristy analysis such as filtering for full spanning contigs with coverage above predefined values based on simulation analyses.
df = fread("/export2/home/uesu/simulation_fr_the_beginning/reAssemble/everybodyelse/out/mdrcontigs.txt") %>% setNames(c("contig", "mdr"))
write.csv(df, "/export2/home/uesu/simulation_fr_the_beginning/reAssemble/everybodyelse/out/mdrcontigs.csv", row.names=FALSE)
dbquery("
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///home/wesley/mdrcontigs.csv' AS MAPPING
MATCH (c:contigs{contig:MAPPING.contig})
SET c.mdr = 1
")
Example if we query ko:K03043, DNA-directed RNA polymerase subunit beta [EC:2.7.7.6].
Naively if we just search for the number of contigs/genes captured within the MDR of the this KO.
dbquery(
"MATCH
(c:contigs{bin:'ko:K03043'}), (k:ko{ko:'ko:K03043'})
WHERE
c.mdr = 1
RETURN
k.ko as KO,
k.simulated as simulated,
count(c) as MDR,
k.definition as definition"
)
After carrying out processing we can retrieve the surviving contigs
dbquery("
MATCH
(k:ko {ko:{ko}})
WITH
k.ko as ko,
k.threshold as cutoff
MATCH
(c:contigs {bin:ko})
WHERE
c.mdr = 1 AND c.spanning =1 and toInt(c.readnum) > cutoff
RETURN
ko,
count(c) as remaining;
", list(ko='ko:K03043'))
Before we begin, we’ll have to load the library and connect to the database
library(MetamapsDB)
connect("<yourNeo4jServer>")
With this we can carry out several simple operations.
For a given contig of interest, in this case a contig found in the K07219 bin, K07219:contig00050, we can
koname(koID)
koname('K18481')
cpdname(cpdID)
cpdname('C00001')
contigInfo(ko, contig)
contigInfo(ko='K00343', contig='contig00001')
#contigInfo(ko='K18481', contig='contig00050')
And Find which taxa it has been assigned to based on diamond + MEGAN
contigInfo(ko, contig, withAnno=TRUE)
#koname('K18481')
#At the highest possible resolution, with a minimum score of 50
linked2contig = contigInfo(ko='K18481', contig='contig00005', withAnno=TRUE)
linked2contig
Show the taxonomic path path2kingdom(taxaID) all the way till Superkingdom rank.
path2kingdom(linked2contig$taxid) %>% make.data.frame
addKOProperty(ko, list(property=value))
Here we used the addKOProperty function which lets you insert properties on the KO node.
scgDF = MetamapsDB::scg %>% head(n=1) %>% lapply(function(koi) { addKOProperty(gsub("ko:", "", koi), list(scg=1))})
dbquery("
UNWIND
{koname} as KOSS
MATCH
(k:ko{ko:KOSS.ko})
RETURN
k.ko,
k.simulated,
k.definition
", params = MetamapsDB::scg %>%
lapply(function(x){ list(ko=x) }) %>% list(koname = .)
) %>% make.data.frame
annotateContigs.taxonomy(path2assignmentTable)
#We could build relationship between the contigs and the taxon it was assigned to
diamondAnnotation = "/home/wesley/newbs29_neo4j" #have to scp to cancer b4 running this
annotateContigs.taxonomy(diamondAnnotation) #sometimes there'll be a mismatch and you'll not be able to find the taxid cause it isnt in the taxonomic database
do not mess up there’s 2 two different types of assignments one is the origin of the contig ie. which bin was it constructed from 2nd is the diamond annotation
newblerDir= "/w/data/newbler/"
dynList = MetamapsDB::scg %>% gsub("ko:", "", .) %>% mclapply(dynamicThreshold, root=newblerDir, mc.cores=20)
plotDF = dynPlots(dynList, F)
plotDF$df %>% select(ko:remaining.cutoff, -remaining) %>% unique %>%
apply(1, function(row){
query = sprintf("match (k:ko{ko:'ko:%s'}) set k.threshold = %s return k.ko, k.threshold", row['ko'], as.integer(row['cutoff']))
dbquery(query)
})
library(cowplot)
load("/w/out/plotDF.rda")
ggdraw() +
draw_plot(plotDF$p, 0, 0.5, width=0.9, height=0.5)+
draw_plot(plotDF$details[[1]]$plot + theme(legend.position="top") + ggtitle("") + guides(fill= guide_legend(nrow =3)), 0, 0.06, width=0.5, height=0.4)+
draw_plot(plotDF$violin, 0.5, 0.06, width=0.4, height=0.4)+
draw_plot_label(c("A", "B", "C"), c(0, 0, 0.5), c(1, 0.4, 0.4), size=15)
You could do several interesting analyses with the MetamapsDB package
the pathways() function lists out the pathways in KEGG.
path2ko(pathwayID) finds all KOs belonging to the same pathway
grepgraph(koID) and prettifyGraph constructs metabolic graph based on given KOs (vector) for plotting
grepgraph_cpd constructs metabolic graph based on the given CPDs (vector), similar to graphgraph but is meant for cpds
nitrogenMetabolism = pathways() %>%
filter(grepl("Nitrogen", pathwayName )) %$%
path2ko(pathwayID) %$%
grepgraph(KO)
Enzyme subunits are given their own KO ids, so when you naivly plot the pathway you might get multiple nodes representing the same enzyme. With contractMetab you could simplify the graph.
nitrogenMetabolism.simplified = contractMetab(nitrogenMetab)
nitrogenMetabolism %>% prettifyGraph %>% plot()
nitrogenMetabolism.simplified %>% prettifyGraph %>% plot()
You could plot an interactive version of the metabolic network using
prettifyGraph coupled with ig2ggplot and plotly package’s ggplotly
p = nitrogenMetabolism.simplified %>% prettifyGraph %>% ig2ggplot(., dfOnly=FALSE)
ggplotly(p)
findTrio trio.local
Combs the metabolic pathway graph for Trios where the middle KO is dominated by a single Genera/OTU`
dbquery(cypherQueryStr, parameters)
cypherQuery must return a table structure not nodes.
When you run aggregated cypher queries you often run into the problem of a column made of lists if you just naively ran the dbquery function you can pipe the output into the make.data.frame function make.data.frame
igraph2gexf
For more details see Official tutorial for warming up
When we first generate a omics database using the omics Dockerimage, the nodes are not indexed, which makes searching slow.
See R code below for building index of commmonly searched IDs eg. ko, contig, taxid, kind of like primary Keys in a SQL database.
dbquery(query="CREATE INDEX ON :ko(ko)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :contigs(contig)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :contigs(bin)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :Taxon(taxid)", justPost = TRUE)
dbquery(query="CREATE INDEX ON :cpd(cpd)", justPost = TRUE)
You can check the status by running :schema in the console